This report is the first one to document and study the feasability of the automatic quality evaluation of experimental literature investigating bio–nano interactions. The first step of this automatic evaluation is to isolate the section Materials and Methods. The goal is to use later this section only to assess if the characterisation of the nano-materials is done and ebaluate the quality of the articles.
This report contain preliminary analyses and exploration of the data contained in the corpus of text. The first goal of this analyses is to gain some understanding of the structure of the texts inside the corpus of articles and the relations of the lemmas “material(s)” and “method(s)” to this corpus.
The second goal is to investigate how to discriminate the beginning of the section “Materials and methods”. The main problem to identify entry of the section Materials and Methods is that some of this two words can be present in the text of the article (typically “cf” material and methods").
The corpus of text has been created from the 751 articles from the folder “Full Text dev set”, which contain 751 articles converted into txt file format. The others articles are kept unseen to test the efficacy of any other tools developped later in “real life condition”.
Few definitions to frame the problem :
Token : Word form or punctuation symbol. “,”, “(” are tokens, but also “and” or “method”.
Lemma : Lemma or stem of word form. “Materials” and “materials” token have the same lemma “materials”, for example.
Head : Head of the current word, which is either a value of token_id or zero.
A quick exploratory data analysis on the article Abrams, MT et al, 2010, led to think that the the “materials” token from the section material and method has a specific property : is head_token_id is equal to zero, i.e. the “head” of this word is itself (cf example under). This led to think that sections titles of aritcles may have this property. This hypothesis will be test in the first part of this report, and in a later section, for the lemma “materials” and “material” (Co-occurences for materials and material when their head_token_id = 0)
In the later section, we will try differents criteria to isolate some lemmas “materials”, “material”, “methods” and “method”. We will use a technic, co-occurences, to explore the surronding of the differents lemmas in the text and evaluate if this criteria allow to discriminate the beginning of the section materials and methods from the remaining of the article.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is a good way to create informal reports describing data analysis projects as a web page, and a good way to mix code and description in a readable maner. There is even books in this format, ranging from Data Analysis for the Life Sciences to Text Mining with R, A Tidy Approach, so anybody can understand and retake this work. This report is also code, it can be recompiled with new data (including an other model for the annotation of the corpus).
library(udpipe)
library(lattice)
library(wordcloud)
library(igraph)
library(ggraph)
library(ggplot2)
library(dplyr)
The following lines load the corpus of text, already annotated and tokenized :
x <- readRDS(file = "annotation_partut.rds")
x <- as.data.frame(x)
length(unique(x$doc_id))
## [1] 751
Here an example of a token “materials” with a head_token_id = 0 :
x[7467,]
## doc_id paragraph_id sentence_id
## 7467 doc1 602 824
## sentence
## 7467 The ethanol was removed using tangential flow filtration, followed by buffer exchange so that the particle is in 100% phosphate-buffered saline.
## token_id token lemma upos xpos feats head_token_id dep_rel
## 7467 5 using use VERB V VerbForm=Ger 4 advcl
## deps misc
## 7467 <NA> <NA>
Considering the observation that, in “Materials and Methods” the head_token_ID was 0 for the token “Materials”, one idea was to explore what are, in the corpus of texts, the most common lemma with a head_token_ID equal to zero.
The expected outcome of this analysis could be to retrieve the usual sections title of scientific articles inside the most common words, like Abstract or Results. The goal is to assess if it is a consistent property of the titles of section inside the articles and uncover potential synonyms to “materials and methods” like “experimental section”.
stats <- subset(x, head_token_id == 0) #https://bnosac.github.io/udpipe/docs/doc7.html
stats <- txt_freq(x = stats$lemma)
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0", xlab = "Freq")
Nonetheless, it seems that this assumption was quite naive, as lot of token have this property. Let’s filter for specific lemmas that correspond to usual title of section, like abstract of results :
stats<-stats %>% filter(key %in% c("material", "materials", "result", "results", "abstract", "introduction" , "method", "methods", "discussion", "references"))
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Count of lemma for usual sections name with Head_token_id = 0", xlab = "Freq")
stats
## key freq freq_pct
## 1 result 1781 0.3014531021
## 2 method 674 0.1140816344
## 3 material 406 0.0687197976
## 4 discussion 224 0.0379143711
## 5 introduction 169 0.0286050389
## 6 materials 50 0.0084630293
## 7 methods 48 0.0081245081
## 8 abstract 22 0.0037237329
## 9 results 9 0.0015233453
## 10 references 2 0.0003385212
Some section titles seems to have the afored mentionned property. Nonetheless, the number does not match the total number of articles in this corpus (751). To take the example of the token discussion, or some articles does not have a section dicussion, or, more probably, the token discussion does not have the property mentionned earlier. We can answer this question :
occurrences<-which(x$lemma=="discussion")
length(occurrences)
## [1] 899
length(unique(x[occurrences,]$doc_id))
## [1] 707
There is 899 occurrences of the word discussion in all the corpus, and 707 article with this word. It seems really likely that discriminating tokens that are section titles just with a head token ID of zero is not sufficient.
To explore the relationships of the lemmas “material(s)” and “method(s)” with the rest of the corpus, we can analyse what are the most recurents head tokens for the lemmas “material” and “materials”. The goals of the analysis are :
grep_lemma_head_token_id <- function(index){
#catch the lemma corresponding to the head_token_id of the token at the entry "index" of x
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
head_token_id<-occurrence$head_token_id
head_token_id<-as.numeric(head_token_id)
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
#the following line query the lemma of the head_token_id based on the previous parameters
lemma_head_token_id<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[head_token_id],]$lemma
if (head_token_id==0) {lemma_head_token_id=occurrence$lemma}
return(lemma_head_token_id)
}
material_occurrences<-which(x$lemma=="material")
head_token_lemmas<-sapply(material_occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring lemma corresponding to the head_token_id \n for lemma material", xlab = "Freq")
occurrences<-which(x$lemma=="materials")
head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma materialS with an s", xlab = "Freq")
occurrences<-which(x$lemma=="method")
head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")
occurrences<-which(x$lemma=="methods")
head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable)
stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")
head(stats, 10)
## head_token_lemmas Freq key
## 49 material 72 material
## 55 methods 48 methods
## 7 and 47 and
## 26 described 23 described
## 45 j 21 j
## 86 Toxicol 18 Toxicol
## 25 describe 17 describe
## 73 question 15 question
## 68 preparation 14 preparation
## 78 see 14 see
In the next sessions we test differents criteria to discriminate the lemmas “materials” and “material” inside the articles. The idea is to find a criteria that allow to identify the beginning of the section “materials and methods”.
Co-occurrence is an analysis that allow to see how words are used either in the same sentence or next to each other. We will use this approach to have a sense of what is the neighbourhood of the lemmas we isolated based on each criteria.
There is several type of cooccurrences analysis : * Looking at which words are located in the same document/sentence/paragraph. * Looking at which words are followed by another word. * Looking at which words are in the neighbourhood of the word as in follows the word within skipgram number of words.
Cf doc of the package Updipe for the three possible use. We will use the second approach, as it is the most relevant to our goal and as it is the most simple to interpret. Differents skipgram can be used to got an idea of the distance or more proximal neighbourhood.
The two function above are meant to gain some place in the document. The first one plot the word network, a common technique to visualise word cooccurrences, after the filtration of the cooccurrences that concerns only the lemma of interrest.
plot_cooccurrence <- function(stats, lemma, title){
#function to gain place and make this Rmarkdown document more clear
stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
wordnetwork <- head(stats, 30)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
geom_node_text(aes(label = name), col = "blue", size = 5) +
theme_graph(base_family = "Helvetica") +
theme(legend.position = "none") +
labs(title = title)
}
head_cooc <- function(stats, lemma){
#function to gain place and make this Rmarkdown document more clear
stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
head(stats, 30)
}
stats <- cooccurrence(x = x$lemma, skipgram = 0)
Bigger skipgram were not really relevant. Here we can simply count the elements of the dataframe stats to see how many times each word follow each other.
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials")
head_cooc(stats, lemma="materials")
## term1 term2 cooc
## 1 materials & 60
## 2 materials science 58
## 3 apply materials 54
## 4 of materials 42
## 5 materials research 35
## 6 Biomedical materials 33
## 7 . materials 30
## 8 and materials 19
## 9 / materials 14
## 10 materials / 14
## 11 methods materials 13
## 12 method materials 8
## 13 in materials 8
## 14 materials Chemistry 7
## 15 materials and 7
## 16 this materials 6
## 17 for materials 6
## 18 materials characterization 5
## 19 materials , 5
## 20 materials section 3
## 21 materials Science 3
## 22 materials inc. 3
## 23 materials commercial 3
## 24 materials engineering 2
## 25 , materials 2
## 26 bulk materials 2
## 27 reference materials 2
## 28 materials catalogue 2
## 29 various materials 2
## 30 materials technology 2
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 material and 800
## 2 . material 513
## 3 material . 453
## 4 material , 391
## 5 material be 347
## 6 of material 246
## 7 the material 242
## 8 this material 203
## 9 test material 156
## 10 material in 155
## 11 material ( 154
## 12 material at 137
## 13 Supplementary material 109
## 14 supplementary material 105
## 15 material for 82
## 16 bulk material 71
## 17 material have 65
## 18 and material 61
## 19 material that 60
## 20 nanotube material 56
## 21 foreign material 53
## 22 material with 53
## 23 reference material 50
## 24 material : 44
## 25 t material 41
## 26 material to 39
## 27 material on 39
## 28 genetic material 39
## 29 in material 37
## 30 material the 33
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods")
head_cooc(stats, lemma="methods")
## term1 term2 cooc
## 1 and methods 151
## 2 methods . 52
## 3 in methods 45
## 4 . methods 39
## 5 methods , 35
## 6 Immunol methods 30
## 7 Mech methods 21
## 8 : methods 19
## 9 methods material 17
## 10 , methods 16
## 11 see methods 16
## 12 methods ) 15
## 13 methods materials 13
## 14 methods section 12
## 15 methods : 11
## 16 methods 2.1 9
## 17 methods preparation 7
## 18 methods test 7
## 19 methods the 7
## 20 alternative methods 7
## 21 methods Chemical 6
## 22 methods animal 6
## 23 methods and 6
## 24 methods For 6
## 25 methods Downloade 6
## 26 methods ( 5
## 27 the methods 5
## 28 methods in 5
## 29 methods Nanoparticle 5
## 30 [ methods 5
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method")
head_cooc(stats, lemma="method")
## term1 term2 cooc
## 1 and method 526
## 2 method . 507
## 3 method for 471
## 4 the method 448
## 5 . method 394
## 6 method of 317
## 7 method be 278
## 8 method , 254
## 9 method to 229
## 10 method ( 214
## 11 this method 151
## 12 method and 137
## 13 method 2.1 134
## 14 method use 130
## 15 method describe 126
## 16 method : 107
## 17 ) method 101
## 18 a method 94
## 19 test method 80
## 20 method [ 80
## 21 method in 79
## 22 method have 61
## 23 method ) 51
## 24 method as 51
## 25 analytic method 47
## 26 sensitive method 46
## 27 method with 44
## 28 method Mol 41
## 29 vitro method 39
## 30 method animal 38
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 material and 800
## 2 and method 526
## 3 . material 513
## 4 method . 507
## 5 method for 471
## 6 material . 453
## 7 the method 448
## 8 . method 394
## 9 material , 391
## 10 material be 347
## 11 method of 317
## 12 method be 278
## 13 method , 254
## 14 of material 246
## 15 the material 242
## 16 method to 229
## 17 method ( 214
## 18 this material 203
## 19 test material 156
## 20 material in 155
## 21 material ( 154
## 22 this method 151
## 23 and methods 151
## 24 method and 137
## 25 material at 137
## 26 method 2.1 134
## 27 method use 130
## 28 method describe 126
## 29 Supplementary material 109
## 30 method : 107
Similar to the previous approach, we want to explore the relationships of the differents lemma with their neighbourhood in the corpus of text, but we restrict the analysis for sentences for which the lemma material or materials is the head token of itself.
Even if not all the “Materials and Methods” section titles has a “materials” lemma with a head_token_id equal to zero, the opposite could be true.
Here, by restricting to the lemmas “materials” and “material” which have a head_token_id = 0, we can visualize their statistical association with other words and understand if this subsets of token is really delimiting the beginning of section “material and methods”.
The first function allow to filter for sentences where the lemma material or materials is the head. The following lines calculate the co-occurrences and draw the plot as previously.
create_subset_corpus<- function(index){
#this function is aimed to help construct a subset of x for the part of the analysis :
#Co-occurences for materials and material when their head_token_id = 0
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
#the following lines collect the head_token_id and test if is equal to zero
#if so, its output the tokens of the sentences
head_token_id<-occurrence$head_token_id
if (head_token_id==0) {return(strip_corpus(doc_id, sentence_id))}
return()
}
strip_corpus <- function(doc_id, sentence_id){
#this function returns all the lemma of a sentence, in the appropriate format
#the purpose of doing so is to allow for calculation of cooccurence of words inside this sentences
#for this we need all the elements of the sentence
sentence_id<-as.numeric(sentence_id)
subset_article<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id),]
return(subset_article)
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when its head_token_id is equal to 0")
head_cooc(stats, lemma="materials")
## term1 term2 cooc
## 1 apply materials 45
## 2 materials & 45
## 3 of materials 2
## 4 materials science 1
## 5 materials Science 1
## 6 methods materials 1
## 7 materials . 1
## 8 ; materials 1
## 9 materials , 1
## 10 Supplementary materials 1
## 11 materials available 1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma materials is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 apply materials 45
## 2 materials & 45
## 3 of materials 2
## 4 materials science 1
## 5 materials Science 1
## 6 Interface methods 1
## 7 methods materials 1
## 8 materials . 1
## 9 ; materials 1
## 10 materials , 1
## 11 Supplementary materials 1
## 12 materials available 1
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when its head_token_id is equal to 0\n when its head_token_id is equal to 0")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 material and 188
## 2 . material 139
## 3 material . 41
## 4 supplementary material 36
## 5 nanotube material 32
## 6 material material 30
## 7 method material 28
## 8 : material 26
## 9 material : 25
## 10 material for 16
## 11 test material 11
## 12 material in 9
## 13 material science 9
## 14 material , 8
## 15 ; material 7
## 16 material ( 5
## 17 Supplementary material 5
## 18 material to 5
## 19 material of 5
## 20 important material 5
## 21 material that 4
## 22 material with 4
## 23 Mesoporous material 4
## 24 composite material 4
## 25 from material 4
## 26 / material 4
## 27 , material 3
## 28 methods material 3
## 29 material & 3
## 30 material 5 3
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma material is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 material and 188
## 2 and method 153
## 3 . material 139
## 4 material . 41
## 5 method 2.1 40
## 6 supplementary material 36
## 7 nanotube material 32
## 8 material material 30
## 9 method material 28
## 10 : material 26
## 11 material : 25
## 12 and methods 20
## 13 material for 16
## 14 method animal 13
## 15 test material 11
## 16 material in 9
## 17 material science 9
## 18 material , 8
## 19 ; material 7
## 20 method Chemical 7
## 21 method : 6
## 22 material ( 5
## 23 Supplementary material 5
## 24 material to 5
## 25 material of 5
## 26 important material 5
## 27 material that 4
## 28 method preparation 4
## 29 method Nanoparticles 4
## 30 material with 4
occurrences<-which(x$lemma=="methods")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods \n when its head_token_id is equal to 0")
head_cooc(stats, lemma="methods")
## term1 term2 cooc
## 1 Immunol methods 16
## 2 . methods 13
## 3 methods : 7
## 4 : methods 4
## 5 methods for 3
## 6 Microbiol methods 2
## 7 experimental methods 2
## 8 methods 2.1 2
## 9 methods 159 2
## 10 Toxicol methods 2
## 11 methods 204 1
## 12 methods 24: 1
## 13 methods ( 1
## 14 methods 1983;65:55 1
## 15 Virol methods 1
## 16 methods 115:99 1
## 17 methods 63 1
## 18 methods experimental 1
## 19 methods 2008;73 1
## 20 methods , 1
## 21 methods 65 1
## 22 culture methods 1
## 23 methods 1988;11:15 1
## 24 methods . 1
## 25 methods 278 1
## 26 methods 2010;62 1
## 27 methods 78 1
## 28 methods 101 1
## 29 methods 62 1
## 30 methods 2011 1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma methods is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 Immunol methods 16
## 2 . methods 13
## 3 methods : 7
## 4 : methods 4
## 5 methods for 3
## 6 Microbiol methods 2
## 7 experimental methods 2
## 8 methods 2.1 2
## 9 methods 159 2
## 10 Toxicol methods 2
## 11 methods 204 1
## 12 methods 24: 1
## 13 methods ( 1
## 14 methods 1983;65:55 1
## 15 Virol methods 1
## 16 methods 115:99 1
## 17 methods 63 1
## 18 methods experimental 1
## 19 methods 2008;73 1
## 20 methods , 1
## 21 t material 1
## 22 material , 1
## 23 methods 65 1
## 24 culture methods 1
## 25 methods 1988;11:15 1
## 26 methods . 1
## 27 methods 278 1
## 28 methods 2010;62 1
## 29 methods 78 1
## 30 methods 101 1
occurrences<-which(x$lemma=="method")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method \n when its head_token_id is equal to 0")
head_cooc(stats, lemma="method")
## term1 term2 cooc
## 1 . method 195
## 2 method for 153
## 3 method : 79
## 4 : method 63
## 5 method to 60
## 6 method Mol 40
## 7 method . 38
## 8 method Enzymol 29
## 9 sensitive method 27
## 10 method in 20
## 11 method of 19
## 12 method method 18
## 13 method and 16
## 14 method ( 15
## 15 the method 15
## 16 ; method 14
## 17 a method 12
## 18 standard method 10
## 19 ) method 10
## 20 and method 9
## 21 method that 8
## 22 easy method 8
## 23 analytic method 8
## 24 & method 8
## 25 vitro method 8
## 26 assay method 8
## 27 , method 7
## 28 nanotoxicity method 7
## 29 revise method 6
## 30 method , 6
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma method is equal to 0")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 . method 195
## 2 method for 153
## 3 method : 79
## 4 : method 63
## 5 method to 60
## 6 method Mol 40
## 7 method . 38
## 8 method Enzymol 29
## 9 sensitive method 27
## 10 method in 20
## 11 method of 19
## 12 method method 18
## 13 method and 16
## 14 method ( 15
## 15 the method 15
## 16 ; method 14
## 17 a method 12
## 18 standard method 10
## 19 ) method 10
## 20 . material 9
## 21 material and 9
## 22 and method 9
## 23 method that 8
## 24 easy method 8
## 25 analytic method 8
## 26 & method 8
## 27 vitro method 8
## 28 assay method 8
## 29 , method 7
## 30 nanotoxicity method 7
We could assume that the last occurrence in an article of the lemma “materials” correspond to the section title “material and methods”. As before, we will use co-occurrences see how words are connected to the last occurrence of “materials” in each documents, and see how often it correspond to a “materials and methods” section.
The first two functions select the last occurrence of a word in a document, and got the id of their sentences. A graph showing the connection of words for this subset of sentences is then plot.
create_subset_corpus_last_lemmas <- function(index){
#this function is aimed to help construct a subset of x for the part of the analysis :
#Co-occurences for materials and material when it is the last lemma of the document
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
lemma<-occurrence$lemma
occurrences_in_doc=which(x$doc_id==doc_id & x$lemma==lemma)
last_occurrence=occurrences_in_doc[length(occurrences_in_doc)]
if (last_occurrence==index){return(strip_corpus(doc_id, sentence_id))}
return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when it is the last lemma of the document")
head_cooc(stats, lemma="materials")
## term1 term2 cooc
## 1 materials science 39
## 2 . materials 26
## 3 of materials 26
## 4 and materials 14
## 5 / materials 13
## 6 materials / 13
## 7 materials & 12
## 8 methods materials 11
## 9 materials research 8
## 10 method materials 7
## 11 apply materials 6
## 12 for materials 6
## 13 Biomedical materials 5
## 14 materials characterization 5
## 15 materials , 5
## 16 materials Chemistry 4
## 17 materials and 4
## 18 this materials 3
## 19 materials section 3
## 20 materials inc. 3
## 21 in materials 3
## 22 materials commercial 3
## 23 bulk materials 2
## 24 various materials 2
## 25 materials Science 2
## 26 & materials 2
## 27 Supplementary materials 2
## 28 : materials 2
## 29 , materials 2
## 30 \u0084\u0084 materials 2
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when materials is the last lemma of the document")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 materials science 39
## 2 . materials 26
## 3 of materials 26
## 4 material and 24
## 5 and method 18
## 6 . material 17
## 7 method 2.1 17
## 8 and materials 14
## 9 / materials 13
## 10 materials / 13
## 11 materials & 12
## 12 methods materials 11
## 13 materials research 8
## 14 method materials 7
## 15 apply materials 6
## 16 for materials 6
## 17 and methods 6
## 18 . method 5
## 19 Biomedical materials 5
## 20 materials characterization 5
## 21 & method 5
## 22 materials , 5
## 23 materials Chemistry 4
## 24 . methods 4
## 25 materials and 4
## 26 : material 4
## 27 this materials 3
## 28 materials section 3
## 29 functional material 3
## 30 materials inc. 3
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when it is the last lemma of the document")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 material and 157
## 2 . material 103
## 3 material . 102
## 4 of material 96
## 5 material at 83
## 6 material be 47
## 7 Supplementary material 46
## 8 material , 39
## 9 the material 38
## 10 this material 29
## 11 material for 23
## 12 material available 22
## 13 material in 19
## 14 nanotube material 19
## 15 material : 17
## 16 material ( 13
## 17 supplementary material 13
## 18 method material 11
## 19 genetic material 11
## 20 and material 11
## 21 t material 11
## 22 nanosized material 10
## 23 reference material 9
## 24 : material 9
## 25 in material 8
## 26 material characterization 8
## 27 material refer 8
## 28 section material 7
## 29 material as 7
## 30 , material 7
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when material is the last lemma of the document")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 material and 157
## 2 . material 103
## 3 and method 102
## 4 material . 102
## 5 of material 96
## 6 material at 83
## 7 material be 47
## 8 Supplementary material 46
## 9 material , 39
## 10 the material 38
## 11 this material 29
## 12 and methods 26
## 13 method 2.1 25
## 14 material for 23
## 15 material available 22
## 16 material in 19
## 17 nanotube material 19
## 18 material : 17
## 19 material ( 13
## 20 supplementary material 13
## 21 method material 11
## 22 genetic material 11
## 23 and material 11
## 24 t material 11
## 25 nanosized material 10
## 26 method animal 9
## 27 reference material 9
## 28 : material 9
## 29 methods , 9
## 30 in material 8
create_subset_corpus <- function(index, target){
#this function is aimed to help construct a subset of x for the part of the analysis :
#Co-occurences for lemma materials and material when they are the first lemma of a sentence
#x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
sentence_id<-occurrence$sentence_id
doc_id<-occurrence$doc_id
#the following line query the first lemma of the sentence in the good document
first_lemma<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[1],]$lemma
if (first_lemma==target) {return(strip_corpus(doc_id, sentence_id))}
return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
target="materials")
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for lemma materials when it is the first lemma of a sentence")
head_cooc(stats, lemma="materials")
## term1 term2 cooc
## 1 materials & 7
## 2 . materials 6
## 3 \u0084\u0084 materials 4
## 4 materials science 2
## 5 : materials 2
## 6 materials Poloxamers 2
## 7 materials today 1
## 8 materials C60 1
## 9 materials , 1
## 10 ) materials 1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 materials & 7
## 2 . materials 6
## 3 & method 5
## 4 method \u0084\u0084 4
## 5 \u0084\u0084 materials 4
## 6 materials science 2
## 7 : materials 2
## 8 materials Poloxamers 2
## 9 materials today 1
## 10 method : 1
## 11 materials C60 1
## 12 materials , 1
## 13 ) materials 1
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
target="material")
subset_corpus<-do.call(rbind, subset_corpus)
stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for lemma material when it is the first lemma of a sentence")
head_cooc(stats, lemma="material")
## term1 term2 cooc
## 1 material and 454
## 2 . material 309
## 3 method material 91
## 4 methods material 53
## 5 : material 33
## 6 , material 24
## 7 material . 23
## 8 material material 22
## 9 ; material 20
## 10 material , 12
## 11 test material 10
## 12 material science 10
## 13 material be 9
## 14 material & 7
## 15 material in 7
## 16 material on 6
## 17 animal material 6
## 18 material the 6
## 19 Amorphous material 6
## 20 characterization material 5
## 21 ) material 5
## 22 material Chitosan 5
## 23 material for 4
## 24 Test material 4
## 25 Nano material 4
## 26 material Ptx 4
## 27 material Pristine 4
## 28 663 material 4
## 29 material 5 4
## 30 Chemical material 3
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")
head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
## term1 term2 cooc
## 1 material and 454
## 2 . material 309
## 3 and method 297
## 4 and methods 135
## 5 method material 91
## 6 methods material 53
## 7 method 2.1 34
## 8 : material 33
## 9 method animal 27
## 10 , material 24
## 11 material . 23
## 12 material material 22
## 13 ; material 20
## 14 method Chemical 16
## 15 method characterization 13
## 16 material , 12
## 17 test material 10
## 18 material science 10
## 19 material be 9
## 20 method Synthesis 8
## 21 method : 7
## 22 material & 7
## 23 methods preparation 7
## 24 material in 7
## 25 method preparation 6
## 26 material on 6
## 27 animal material 6
## 28 methods test 6
## 29 material the 6
## 30 Amorphous material 6